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In the Multi-Armed Bandit (MAB) problem, there is a given set of arms 
with unknown reward models. At each time, a player selects one arm to play, 
aiming to maximize the total expected reward over a horizon of length T. An 
approach based on a Deterministic Sequencing of Exploration and Exploita- 
tion (DSEE) is developed for constructing sequential arm selection policies. 
It is shown that when the moment-generating functions of the arm reward 
distributions are properly bounded around zero, the optimal logarithmic or- 
der of the regret (defined as the total expected reward loss against the ideal 
case with known reward models) can be achieved by DSEE. The condition on 
the reward distributions can be gradually relaxed at a cost of a higher (nev- 
ertheless, sublinear) regret order: for any positive integer p, 0{T^^^) regret 
can be achieved by DSEE when the moments of the reward distributions exist 
(only) up to the pth order. The proposed DSEE approach complements exist- 
ing work on MAB by providing corresponding results under a set of relaxed 
conditions on the reward distributions. Furthermore, with a clearly defined 
tunable parameter — the cardinality of the exploration sequence, the DSEE 
approach is easily extendable to variations of MAB, as demonstrated by its 
generalization to MAB with various objectives and decentralized MAB with 
multiple players and corrupted reward observations. 



1. Introduction. 

LL Multi-Armed Bandit. Multi-armed bandit (MAB) is a class of sequential 
learning and decision problems with unknown models. In the classic MAB, there 
are N independent arms and a single player. At each time, the player chooses one 
arm to play and obtains a random reward drawn i.i.d. over time from an unknown 
distribution. Different arms may have different reward distributions. The design 
objective is a sequential arm selection policy that maximizes the total expected 
reward over a long but finite horizon T. The MAB problem finds a wide range 
of applications including clinical trials, target tracking, dynamic spectrum access, 
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Internet advertising and Web search, and social economical networks (see [1, 2, 3] 
and references therein). 

In the MAB problem, each received reward plays two roles: increasing the 
wealth of the player, and providing one more observation for learning the reward 
statistics of the aim. The tradeoff between exploration and exploitation is thus 
clear: which role should be emphasized in arm selection — an arm less explored 
thus holding potentials for the future or an arm with a good history of rewards? In 
1952, Robbins addressed the two-armed bandit problem [1]. He showed that the 
same maximum average reward achievable under a known model can be obtained 
by dedicating two arbitrary sublinear sequences for playing each of the two arms. 
In 1985, Lai and Robbins proposed a finer performance measure, the so-called re- 
gret, defined as the expected total reward loss with respect to the ideal scenario of 
known reward models (under which the aim with the lai^gest reward mean is always 
played) [4]. Regret not only indicates whether the maximum average reward under 
known models is achieved, but also measures the convergence rate of the average 
reward, or the effectiveness of learning. Although all policies with sublinear regret 
achieve the maximum average reward, the difference in their total expected reward 
can be arbitrarily large as T increases. The minimization of the regret is thus of 
great interest. Lai and Robbins showed that the minimum regret has a logarithmic 
order in T. Furthermore, for Gaussian, Bernoulli, Poisson and Laplacian distribu- 
tions, policies were explicitly constructed to achieve this minimum regret' . Since 
the seminal" work by Lai and Robbins, simpler index-type policies were developed 
by Agrawal in 1995 [5] and Auer et al. in 2002 [6]. These policies achieve the 
logarithmic regret order under different conditions on the reward distributions. 

In the classic policies developed by Lai and Robbins [4], Agrawal [5] and Auer et 
al. [6], arms are prioritized according to two statistics: the sample mean 9{t) cal- 
culated from past observations up to time t and the number r(t) of times that the 
arm has been played up to t. The larger 6{t) is or the smaller r(t) is, the higher 
the priority given to this arm in arm selection. The tradeoff between exploration 
and exploitation is reflected in how these two statistics are combined together for 
arm selection at each given time t. This is most clearly seen in the UCB 1 (Upper 
Confidence Bound) policy proposed by Auer et al. in [6], in which an index I{t) 
is computed for each arm and the arm with the largest index is chosen. The index 
(referred to as the upper confidence bound) has the following simple form: 



'For the existence of an optimal policy in general, Lai and Robbins established a sufficient condi- 
tion on the reward distributions. However, the condition is difficult to check and is only verified for 
the specific distributions mentioned above. 



(1.1) 
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This index form is intuitive in the light of Lai and Robbins's result on the loga- 
rithmic order of the minimum regret which indicates that each arm needs to be 
explored on the order of log t times. For an aim sampled at a smaller order than 
log t, its index, dominated by the second term, will be sufficient large for large t to 
ensure further exploration. 

1.2. Deterministic Sequencing of Exploration and Exploitation. In this paper, 
we develop a new approach to the MAB problem. Based on a Deterministic Se- 
quencing of Exploration and Exploitation (DSEE), this approach differs from the 
classic policies proposed in [4, 5, 6] by separating in time the two objectives of 
exploration and exploitation. Specifically, time is divided into two interleaving se- 
quences, in one of which, arms are selected for exploration, and in the other, for 
exploitation. In the exploration sequence, the player plays all arms in a round-robin 
fashion. In the exploitation sequence, the player plays the arm with the largest sam- 
ple mean. Under this approach, the tradeoff between exploration and exploitation is 
reflected in the cardinality of the exploration sequence. It is not difficult to see that 
the regret order is lower bounded by the cardinality of the exploration sequence 
since a fixed fraction of the exploration sequence is spent on bad arms. Never- 
theless, the exploration sequence needs to be chosen sufficiently dense to ensure 
effective learning of the best arm. Otherwise, the regret will be dominated by the 
reward loss in the exploitation sequence caused by incorrectly identified arm rank. 
The key here is thus finding the minimum cardinality of the exploration sequence 
that ensures a reward loss in the exploitation sequence having an order no larger 
than the cardinality of the exploration sequence. 

We show that when the moment-generating functions of the reward distributions 
are properly bounded around zero, DSEE achieves the optimal logarithmic order of 
the regret using an exploration sequence with 0(log T) cardinality. The condition 
on the reward distributions can be gradually relaxed at a cost of a higher (never- 
theless, still sublinear) regret order. Specifically, we show that for any p, when the 
moments of the reward distributions only exist up to the pth order, 0(r^/P) regret 
can be achieved using an exploration sequence with 0{T^^p) cardinality. This re- 
sult reveals an interesting dependency of the regret on the tail probabilities of the 
reward distributions: a denser exploration sequence is needed when the reward dis- 
tributions have heavier tails (which makes learning more difficult). In all cases, the 
regret is sublinear; the maximum average reward defined by the ideal scenario of 
known reward models is achieved. 

Compared to the classic policies proposed in [4, 5, 6] that focus on some spe- 
cific reward distributions in the exponential family [4, 5] or those with finite sup- 
port [6], DSEE offers corresponding results under a set of relaxed conditions on 
the reward distributions. More specifically, the condition on the reward distribu- 



4 



tions for achieving the optimal logarithmic regret order is more general than those 
assumed in [4, 5, 6]. Furthermore, DSEE offers the possibility of sublinear regret 
for reward distributions with heavy tails. A distinct feature of the DSEE approach 
is that it has a clearly defined tunable parameter — the cardinality of the exploration 
sequence — which can be adjusted according to the "hardness" (in terms of leai^n- 
ing) of the reward distributions and the observation models. It is thus more easily 
extendable to handle variations of MAB as discussed in the next subsection. 

We point out that for both the classic policies in [4, 5, 6] and the DSEE policies 
developed in this paper, certain knowledge on the reward distributions is needed 
for pohcy construction. In particular, the policies proposed in [4, 5] require the 
knowledge of the distribution type (e.g., Gaussian or Laplacian) of each arm. The 
policies proposed in [6] require that the reward distributions have finite support 
with a known support range. While DSEE achieves the optimal logarithmic regret 
order for a larger set of reward models, it requires a positive lower bound on the 
difference in the reward means of the best and the second best arms. This can be 
a more demanding requirement than the distribution type or the support range of 
the reward distributions. By increasing the cai^dinality of the exploration sequence, 
however, we show that DSEE achieves a regret arbitrarily close to the logarithmic 
order without any knowledge of the reward model. We further emphasize that the 
sublinear regret for reward distributions with heavy tails is achieved without any 
knowledge of the reward model (other than a lower bound on the order p of the 
highest finite moment). 

1.3. ExtendabUity to Variations of MAB. In the previous subsection, we em- 
phasized the agility of the DSEE approach in handling reward distributions with 
heavy tails and the lack of any prior knowledge through the adjustment of the 
cardinality of the exploration sequence. In this subsection, we show that the deter- 
ministic separation of exploration from exploitation allows easy extendability of 
DSEE to decentralized MAB problems. 

In a decentraUzed MAB, there are M {M < N) players. At each time, each 
player chooses one arm to play. When multiple players choose the same arm, the 
reward offered by the arm is distributed arbitrarily among the players, not neces- 
sarily with conservation. Such an event is referred to as a collision. Players are dis- 
tributed: actions and rewards of other players are unobservable, and no information 
can be exchanged among players. As a consequence, collisions are unobservable; a 
player does not know whether it is involved in a collision, or equivalently, whether 
the received reward reflects the true state of the arm. Collisions thus not only result 
in immediate reward loss, but also corrupt the observations that a player relies on 
for learning the arm rank. 

The deterministic separation of exploration and exploitation in DSEE, however. 
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can ensure that collisions are contained within the exploitation sequence. Learn- 
ing in the exploration sequence is thus carried out using only reliable observations. 
Specifically, in the exploration sequence, players play all arms in a round-robin 
fashion with different offsets which can be predetermined based on, for example, 
the players' IDs, to eliminate collisions. In the exploitation sequence, each player 
plays the M arms with the largest sample means calculated using only observations 
from the exploration sequence under either a prioritized or a fair sharing scheme. 
While collisions still occur in the exploitation sequence due to the difference in the 
estimated arm rank across players caused by the randomness of the sample means, 
their effect on the total reward can be limited through a carefully designed cardinal- 
ity of the exploration sequence. In particular, we show that under the decentralized 
policy based on DSEE, the system regret, defined as the total reward loss with re- 
spect to the ideal scenario of known reward models and centralized collision-free 
scheduling among players, grows at the same orders as the regret in the single- 
player MAB under the same conditions on the reward distributions. These results 
hinge on the extendability of DSEE to targeting at arms with arbitrary ranks (not 
necessarily the best arm) and the sufficiency in learning the arm rank solely through 
the observations from the exploration sequence. 

1.4. Related Work. The DSEE approach complements the classis results on 
MAB as detailed in Sec. 1.1 and 1.2. In the context of decentralized MAB with 
multiple players, the problem was formulated in [7] with a simpler collision model: 
regai^dless of the occurrence of collisions, each player always observes the actual 
reward offered by the selected arm. In this case, collisions affect only the immedi- 
ate reward but not the learning ability. It was shown that the optimal system regret 
has the same logarithmic order as in the classic MAB with a single player, and 
a Time-Division Fair sharing (TDFS) framework for constructing order-optimal 
decentralized policies using any order-optimal single-player policy as the basic 
building block was proposed. The same decentralized MAB models as in [7] were 
also considered in [8, 9] in the context of dynamic spectrum access, where order- 
optimal distributed policies were established based on UCBl proposed in [6]. In 
particular, in [9], UCBl was extended to targeting at the ?nth (1 < m < N) best 
arm by considering both the upper confidence bound given in [6] and a symmet- 
ric lower confidence bound. Based on this extension, decentralized polices under 
both prioritized and fair access scenarios were proposed. In [10], the decentralized 
MAB with a special imperfect observation model was considered in the context of 
dynamic spectrum access. Specifically, the aim reward was drawn from the inter- 
val [0, 1]. Under a collision, either no one gets the reward or the colliding players 
share the reward evenly. Each player can only observe the received reward. Under 
a non-cooperative game framework, i.e., each player solely aims to maximize its 
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own average rewrd. It was shown that the system achieves the maximum long-term 
average reward when each player adopts the single-player policy proposed in [1 1]. 
The system regret order was not considered. 

The results presented in this paper and the related work discussed above are 
developed within the non-Bayesian framework of MAB in which the unknowns in 
the reward models are treated as deterministic quantities and the design objective is 
universally (over all possible values of the unknowns) good policies. The other line 
of development is within the Bayesian framework in which the unknowns are mod- 
eled as random variables with known prior distributions and the design objective 
is policies with good average performance (averaged over the prior distributions 
of the unknowns). By treating the posterior probabilistic knowledge (updated from 
the prior distribution using past observations) about the unknowns as the system 
state, Bellman in 1956 absti'acted and generalized the Bayesain MAB to a spe- 
cial class of Markov decision processes [12]. The long-standing Bayesian MAB 
was solved by Gittins in 1970s where he established the optimality of an index 
policy — the so-called Gittins index policy [13]. In 1988, Whittle generalized the 
classic Bayesian MAB to the restless MAB and proposed an index policy based 
on a Lagrangian relaxation [14]. Weber and Weiss in 1990 showed that Whittle in- 
dex policy is asymptotically optimal under certain conditions [15, 16]. In the finite 
regime, the strong performance of Whittle index policy has been demonstrated in 
numerous examples (see, e.g., [17, 18, 19, 20]). 

2. The Classic MAB. Consider an A^-arm bandit and a single player. At each 
time t, the player chooses one arm to play. Playing arm n yields i.i.d. random re- 
ward Xn{t) drawn from an unknown distribution fn{x). Let = {fi{x), - ' ' > fN{x)) 
denote the set of the unknown distributions. We assume that the reward mean 
6ln = E[X„(t)] exists for all 1 < n < iV. 

An aim selection policy vr is a function that maps from the player's observation 
and decision history to the arm to play. Let cr be a permutation of {1, • • • , A^} such 
that ^^.(i) > d„(^2) ^ • • • ^ (^a{N)- The system performance under pohcy vr is 
measured by the regret K!^{T) defined as 

where X.,^{t) is the random reward obtained at time t under policy vr, and E7r[-] 
denotes the expectation with respect to policy vr. The objective is to minimize the 
rate at which K!^{F) grows with T under any distribution set T by choosing an 
optimal policy vr*. We say that a policy is order-optimal if it achieves a regret 
growing at the same order of an optimal policy. We point out that any policy with 
a sublinear regret order achieves the maximum average reward ^cr(i)- 
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3. The DSEE Approach. In this section, we present the DSEE approach and 
analyze its performance under different conditions on the reward distributions. 

3.1. The General Structure. Time is divided into two interleaving sequences: 
an exploration sequence and an exploitation sequence. In the exploration sequence, 
the player plays all arms in a round-robin fashion. In the exploitation sequence, 
the player plays the arm with the largest sample mean calculated from past re- 
ward observations. It is also possible to use only the observations obtained in the 
exploration sequence in computing the sample means. This leads to the same re- 
gret order with a significantly lower complexity since the sample means only need 
to be updated at the same sublinear rate as the exploration sequence. A detailed 
implementation of DSEE is given in Fig 1 in which only observations from the 
exploration sequence are used in computing the sample means. 

The DSEE Approach 

• Notations and Inputs: Let A{t) denote the set of time indices that belong to the explo- 
ration sequence up to (and including) time t. Let denote the cardinality of A{t). 
Let 9n{t) denote the sample mean of arm n computed from the reward observations at 

times in A{t — 1). For two positive integers k and I, define k 1 = {{k — 1) mod /) + 1, 
which is an integer taking values from 1, 2, • • • ,1. 

• At time t, 

1. ift Gy4(t),playarmn= |^(t)|0iV; 

2. if t ^ Ait), play arm n = arg max{^„(t), 1 < ?i < iV}. 



Fig I. The DSEE approach for the classic MAB. 

In DSEE, the tradeoff between exploration and exploitation is balanced by choos- 
ing the cardinality of the exploration sequence. To minimize the regret growth rate, 
the cardinality of the exploration sequence should be set to the minimum that en- 
sures a reward loss in the exploitation sequence having an order no larger than the 
cardinality of the exploration sequence. This is explicitly stated in the following 
lemma. 

Lemma 3.1. Let RJ^q{J^) and R!^j{J-) denote, respectively, the regret in- 
curred in the exploration and exploitation sequences. We have 

(3.1) BJ^{T) = R^oiJ") + RtAJ") = mioi:^))- 

There exists an order-optimal policy vr* such that 
(i) EJf (T) = OiEJfoiJ')), 
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(ii) For any policy vr with an exploration sequence of cardinality o{K!f q{T)), 
we have K^{F) = n{RJf (T)). 

Furthermore, if(i) and (ii) hold, then tt* is an order-optimal policy. 

Proof. Equation (3. 1) is obvious. We consider an order-optimal policy vr* with 
a specific exploration sequence. If R!^ (T) = ^1{R^*q{T)), we can then increase 
the cardinaUty of the exploration sequence to the order of i2J (T). Since 
is nonincreasing with the cardinality of the exploration sequence, the regret order 
remains optimal and (i) holds for this order-optimal policy with a denser explo- 
ration sequence. By observing that any exploration sequence with a cardinality 
order less than the optimal one cannot lead to a better regret, we can see that (ii) 
also holds. Now we assume that both (i) and (ii) hold for a policy vr*. If there exists 
a policy achieving a smaller regret order compared to vr*, then (i) contradicts (ii). 
This leads to the order-optimality of vr*. □ 

3.2. The Logarithmic Regret. In this section, we construct an exploration se- 
quence in DSEE to achieve the optimal logarithmic regret order under the following 
condition on the reward distributions. 

CI. There exist C > and lio > such that E[exp((X - e)u)] < exp{Cu^/2) 
for all u with \u\ < uq. 

Conditions C 1 implies that the reward distributions have central moments up to an 
arbitrary order and the diverging rate of the moment sequence is properly bounded. 
CI thus imposes constraints on the deviation of the random variable X from its 
expected value 6, leading to the following Chernoff-Hoeffding bound that states 
the exponential convergence of the sample mean to the true mean. 

Lemma 3.2. (Chernoff-Hoeffding Bound [21]) Let {X{t)}^^ be i.i.d. ran- 
dom variables drawn from a distribution satisfying CL Let Xg = {J^l^iX{t)) / s 
ande = W.[X{1)]. We have, for all 5 <^ [0,Cno],ae (0, 1/(2C)], 

(3.2) Vi{\Ts-e\>5)<2exp{-a6^s). 

Proven in [21], Lemma 3.2 extends the original Chernoff-Hoeffding bound given 
in [22] that considers only random variables with a finite support. It is not difficult 
to show that reward distributions in the exponential family (Gaussian, Poisson, 
Laplacian, and Exponential) as considered in [4, 5, 23] or those with a finite support 
as considered in [6] satisfy CI. Assuming only CI, DSEE thus offers the optimal 
logarithmic regret order for a more general set of reward distributions {e.g., the 
Weibull distribution) as shown in Theorem 3.1 below. 
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Theorem 3.1. Construct an exploration sequence as follows. Let a, (, uq be 
the constants such that (3.2) holds. Choose a constant b > 2/a, a constant c € 
(0,0(^(1) — ^(r(2)). and a constant w > max{6/(Ciio)^, 46/c^}. For each t > 1, if 
\A{t — 1)1 < Nlw log t~\, then include t in A{t). Under this exploration sequence, 
the resulting DSEE policy vr* has regret 

(3.3) Bjf{T)<C\ogT 
for some constant C independent ofT. 

Proof. Without loss of generality, we assume that {6n\n=i distinct. From 
the construction of the exploration sequence in vr*, it is easy to see that K!^q{T) 
has a logarithmic order. From (3.1), it suffices to show that Kf j{J') has at most a 
logarithmic order. In particular, based on the Chernoff-Hoeffding bound given in 
Lemma 3.2, we show that KJfj{J') is bounded by some constant independent of 
T. 

During the exploitation sequence, a reward loss happens if the player incorrectly 
identifies the best arm. To bound Wfj{T), we need to bound the number of leai^n- 
ing mistakes at the player. Let Ei^ denote the kth exploitation period which is the 
kth contiguous segment in the exploitation sequence. We have 

RfA^) = O(E[Si^^(r)I(vr*(t)/0.(i))]) 

(3.4) = 0(S^=iPr(7r*(t) during I). 

In the following, we bound Pr(7r*(t) ^ during Ej.) and \Ej. \ respectively. 
Consider \Ek\ first. Let > I denote the starting time of the A;th exploitation 
period. We have 

(3.5) \A{tk-l)\=N\wlogtk]. 

Starting from time tk, the next exploration period starts at time t if 

N \w log(t - 1)1 < \A{tk -l)\<N\w log t] . 
Combined with (3.5), we have t < ht^ for some constant h. Equivalently, we have 

(3.6) \Ek\=t-tk<{h-l)tk. 

Now we consider Pr(7r*(t) ^ (^cr{i) during Ek). We have 

(3.7) Pr(7r*(t) ^ during Ek) < Pr(3 l<j<N s.t. > ^.(i)(tfc)). 

Let Tn{t) denote the number of times that aim 7i has been played during the ex- 
ploration sequence up to time t. Recall the parameter b defined in the theorem, 
define 

enit) = Viblogt)/Tn{t)>0. 
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For each bad arm a{j) {j > 1), we have 

(3-8) 6(^(1) (ifc) = e^(j)(tfe) < (6*^(1) - 6'ct(j))/2, 

in which the inequaUty is due to the fact that Tn{tk) > (461ogtfc/c^) for any 
n (1 < n < A'^). From (3.8), the eo-(i)(ffc)-neighbor of the mean of arm a{l) is 
non-overlapping with that of arm a{j). To bound the possibility of misidentified 
rank of arm cr(l) and aim (T(j), it is sufficient to bound the probability of event 
{\ds{tk) - Os\ > es{tk)} for s G {a{l),a{j)}. We further observe that, for s G 
{a{l),a{j)}, es{tk) < C^o due to the fact that r,(tfc) > blogtk/{Cuo)^- The 
Chernoff-Hoeffding bound given in Lemma 3.2 is thus applicable (by choosing 
5 = ^s{tk)) and we have, for s € {cr(l), o'(j)}, 

(3.9) Pii\9s{tk) - Bs\ > es(tfc)) < 2^^'^^ 
We can then bound (3.7) as follows. 

(3.10) Pr(7r*(t) during < 5*^' 

for some constant g. 

Based on (3.6) and (3.10), we bound (3.4) as follows. 

(3.11) 



We thus proved Theorem 3. 1. □ 

We point out that the policy depends on certain knowledge about the differen- 
tiability of the best arm. Specifically, we need a lower bound (parameter c defined 
in Theorem 3.1) on the difference in the reward means of the best and the second 
best arms. We also need to know the bounds on parameters C and uq such that the 
Chernoff-Hoeffding bound (3.2) holds. These bounds are required in defining w 
that specifies the minimum leading constant of the logarithmic cardinality of the 
exploration sequence necessary for identifying the best arm. However, we show 
that without any knowledge of the reward models, we can increase the cai^dinality 
of the exploration sequence of vr* by an arbitrarily small amount to achieve a regret 
ai^bitraiily close to the logarithmic order. 



= 0(S^=i Pr(7r*(t) / 0,(1) during Ek)\Ek\) 

= 0(Sf^it~"^+i) (note that ab > 2) 
= 0(1)- 
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Theorem 3.2. Let g{t) be any positive increasing sequence with g[t) oo 
as t ^ oo. Revise policy tt* in Theorem 3.1 as follows: include t (t > 1) in A{t) if 
\A{t — 1)1 < \g{t) log t\. Under the revised policy vr', we have 

l4{T) = 0{g{t)\ogt). 

Proof. The proof is similar to that of Theorem 3. 1 up to (3.4). It is not difficult 
to show that equation (3.6) still holds with a different constant. Let h{t) be any 
positive increasing sequence with h{t) = o{g{t)) and b{t) ^ oo as t oo. To 
show (3.8), we choose 

esitk) = V(&(t)logt)/r.(t), s G 

which, after certain deterministic time, become non-overlapping neighbors of ^^(^^ 
and ^o-(j) due to the fact that b{t) = o{g{t)). Based on the same fact, the Chernoff- 
Hoeffding bound (3.2) is applicable by choosing 6 = es{tk) and (3.9) still holds (by 
replacing b by b{t)) after certain deterministic time. The proof is then completed 
by noticing that the quantity ab (now ab{t)) in (3.11) becomes larger than 2 after 
ceratin deterministic time due to the fact that b{t) — )• oo as t — > oo. □ 

3.3. Achieving Sublinear Regret under Relaxed Conditions. In Sec. 3.2, we 
adopted a condition (CI) on the tail probabilities of the reward distributions T to 
ensure that the Chernoff-Hoeffding bound holds. That condition implies the exis- 
tence of the moments of the reward distributions at any order. In this subsection, we 
consider a relaxed condition that only requires the existence of the central moments 
up to a certain order: 

C2 There exists a p > 1 such that E|X — 6*1^ < oo. 

Under condition C2, the Chernoff-Hoeffding bound does not hold in general. A 
weaker bound on the deviation of the sample mean from the true mean was estab- 
lished in [24], as given in the lemma below. 

Lemma 3.3. (One-Sided Bound on Boundary Crossings [24]) Let {X{t)}'^^ 
be i.i.d. random variables satisfying E|X(1)|'^' < oo for some 1 < r < 2 and 
E(X(1)+)P < oo for some p > r. Let Sk = Sj^iX(s) and 6 = E[X{1)]. We 
have, for all ar > 1 and e > 0, 

(3.12) _ j^Q^ > g^a) ^ ^_ 

l<fc<i 

Based on Lemma 3.3, we have the following probabiUstic bound on the devia- 
tion of the sample mean from the true mean. 
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Lemma 3.4. Let {X(i)}^]^ be i.i.d. random variables drawn from a distribu- 
tion satisfying C2. Let Xf = (S|^-^X(A;))/t and 9 = E[X(1)]. We have, for all 
e > 0, 

(3.13) Pr{\Xt-9\> e) = o{t^~P). 

Proof. By choosing a = 1 and an r (1 < r < m.m{p, 2}) in Lemma 3.3, we 
have, for all e > 0, 

(3.14) ^Zii''~^'P^i\^t-9\> e) <oo. 

The double-sided bound holds since both E{X{1)+)p and E{X{1)-)p exist un- 
der C2. By noticing that the term within the summation of the left-hand side 
in (3. 14) is equal to o(t~^), we arrive at (3. 13). □ 

Based on Lemma 3.4, we can choose the cardinality of the exploration sequence 
under DSEE. In the next theorem, we show that by choosing an exploration se- 
quence with cardinality of 0{T^^p), the system regret with order T^/p can be 
achieved. 

Theorem 3.3. Construct an exploration sequence as follows. Choose a con- 
stant t; > 0. For each t > 1, if \A{t — 1)| < vt^^P, then include t in A{t). Under 
this exploration sequence, the resulting DSEE policy ttP has regret 

(3.15) R^\T) < DT^/P 
for some constant D independent of T. 

Proof. Without loss of generality, we assume that {On]n=i distinct. Based 
on (3.1), it is sufficient to show that = o(T^/p). Choose e G (0, min{6'<^(i) - 

^a{j) '■ 1 < ^ < J < For a t that belongs to the exploitation sequence, 

define the following event 

£{t) = {\en{t) -On\<e,yi<n<N}. 

On event £, the player correctly identifies the best arm, i.e., the regret is zero. The 
regret incurred in the exploitation sequence is thus at the order of the number of 
time instances at which event £{■) does not happen. We thus have 

RtA^) = 0(Si^^(r),t<TPr(^)) 

= 0{J:tmT),t<T^n=l'P^{\On -9n{t)\ > e)) 

(3.16) = Oi^tMint<T^n=io{\A{t)\'-n), 
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where the last equality is due to Lemma 3.4. 

By the construction of the exploration sequence, for any t ^ A{t), we have 
\A{t)\ > vt^/P. From (3.16), we have 

(3.17) R^'jiT) = 0(Ef=io(t(i-f)/P)). 
We further note that 

(3.18) Sf^io(t(i-P)/P) = o( r t^^~P^/Pdt) = o{T^/P). 

Jt=i 

From (3.17) and (3.18), we have 

R^j{J^) = o{T^/P). 
We thus proved the theorem. □ 

4. Variations of MAB. In this section, we extend the DSEE approach to vari- 
ations of MAB including MAB under various objectives and decentralized MAB 
with corrupted reward observations. 

4.1. MAB under Various Objectives. Consider a generalized MAB problem in 
which the desired arm is the mth best arm for an arbitrary m. Such objectives 
may arise when there are multiple players (see the next subsection) or other con- 
straints/costs in aim selection. The classic policies in [4, 5, 6] cannot be directly 
extended to handle this new objective. For example, for the UCB 1 policy proposed 
by Auer et al. in [6], simply choosing the arm with the mth (1 < m < A^) largest 
index cannot guarantee an optimal solution. This can be seen from the index form 
given in ( 1 . 1 ) : when the index of the desired arm is too large to be selected, its index 
tends to become even larger due to the second term of the index. The rectification 
proposed in [9] is to combine the upper confidence bound with a symmetric lower 
confidence bound. Specifically, the arm selection is completed in two steps at each 
time: the upper confidence bound is first used to filter out arms with a lower rank, 
the lower confidence bound is then used to filter out arms with a higher rank. It was 
shown in [9] that under the extended UCB 1, the expected time that the player does 
not play the targeted aim has a logarithmic order. 

The DSEE approach, however, can be directly extended to handle this general 
objective. Under DSEE, all arms, regardless of their ranks, are sufficiently explored 
by carefully choosing the cardinality of the exploration sequence. As a conse- 
quence, this general objective can be achieved by simply choosing the arm with 
the mth largest sample mean in the exploitation sequence. Specifically, assume 
that a cost Cj > (j 7^ m, 1 < j < N) is incurred when the player plays the jth 
best aim. Define the regret RJ^ {T, m) as the expected total costs over time T under 
poUcy vr. 
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Theorem 4.1. By choosing the parameter c in Theorem 3.1 to satisfy < 
c < min{0a.{m-i) ~ ^CT(m) 5 ^o-(m) ~ ^o-(m+i)} <^>^d letting the player select the 
arm with the m-th largest sample mean in the exploitation sequence, Theorem 3.1, 
Theorem 3.2 and Theorem 3.3 hold for R'^{J^, m). 

Proof. The proof is similar to those of previous theorems. The key observation 
is that after playing all arms sufficient times during the exploration sequence, the 
probability that the sample mean of each arm deviates from its ti"ue mean by an 
amount lai^ger than its non-overlapping neighbor (see (3.8)) is small enough to 
ensure a properly bounded regret incurred in the exploitation sequence. □ 

We now consider an alternative scenario that the player targets at a set of best 
arms, say the M best arms. We assume that a cost is incurred whenever the player 
plays an arm not in this set. Similarly, we define the regret RJ[,{T, M) as the ex- 
pected total costs over time T under policy vr. 

Theorem 4.2. By choosing the parameter c in Theorem 3.1 to satisfy < 
c < Ofj(^M) ~ ^a{M+i) '^^d letting the player select one of the M arms with the 
largest sample means in the exploitation sequence. Theorem 3.1, Theorem 3.2 and 
Theorem 3.3 hold for RJj.{T, M). 

Proof. The proof is similar to those of the previous theorems. Compared to 
Theorem 4.1, the condition on c for applying Theorem 3.1 is more relaxed: we 
only need to know a lower bound on the mean difference between the Af -th best 
and the (M+ l)-th best aims. This is due to the fact that we only need to distinguish 
the M best arms from the rest instead of specifying their rank. □ 

By selecting arms with different ranks of the sample mean in the exploitation 
sequence, it is not difficult to see that Theorem 4.1 and Theorem 4.2 can be ap- 
plied to cases with time-varying objectives. In the next subsection, we use these 
extensions of DSEE to solve a class of decentralized MAB with imperfect reward 
observations. 

4.2. Decentralized MAB with Corrupted Reward Observations. 

4.2.1. Distributed Learning and Its Applications. Consider M distributed play- 
ers. At each time t, each player chooses one arm to play. When multiple play- 
ers choose the same arm (say, arm n) to play at time t, a player (say, player 
m) involved in this collision obtains a potentially reduced reward Yn^rn{t) with 
X]m=i ^n,m{t) < Xn{t). The distribution of Yn,m{t) can take any unknown form 
and has any dependency on n, m and t. Players make decisions solely based on 
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local reward observations without information exchange. Consequently, a player 
does not know whether it is involved in a collision, or equivalently, whether the 
received reward reflects the true state {Xn{t)) of the arm. 

A local arm selection policy iTm of player m is a function that maps from the 
player's observation and decision history to the arm to play. A decentralized arm 
selection policy vr is thus given by the concatenation of the local polices of all 
players: 

TTd = [VTI , • • • jTTm]- 

The system performance under policy vr^ is measured by the system regret R!^'' (T) 
defined as the expected total reward loss up to time T under policy vr^ compared to 
the ideal scenario that the collision among the players is avoided through central- 
ized scheduling and T is known to all players (thus the M best arms with highest 
means are played at each time). We have 

where YT^^{t) is the total random reward obtained at time t under decentralized 
policy TTd- Similar to the single-player case, any policy with a sublinear order of 
regret would achieve the maximum average reward given by the sum of the M 
highest reward means. 

One potential application of the problem is dynamic spectrum access in which 
M secondary users independently search for spectrum opportunities among N 
channels. The state/reward — (busy) or 1 (idle) — of each channel is modeled as 
an i.i.d. Bernoulli process with unknown mean. At each time, a secondary trans- 
mitter chooses a channel to sense and subsequently transmits in this channel if the 
channel is sensed to be idle. Sensing is assumed to be imperfect: a false alarm or a 
miss detection can happen at each time. When multiple users transmit in the same 
channel, they collide and no one transmits successfully (i.e., no one gets reward). 
The distribution of the reward received and observed by a user thus depends on the 
number of players that sensed the same channel, which is unknown and may also 
be time-varying. If a transmission is successful, the receiver sends an ACK back 
to its transmitter at the end of the transmission. Each secondary transmitter and 
its receiver need to use the common reward observations {i.e., ACKs) in decision 
making to ensure synchronous channel selections. The problem thus falls into the 
imperfect observation model considered here. 

Another potential application is multi-agent systems in which M agents search 
or collect targets in N locations. When multiple agents choose the same location, 
they share the reward in an unknown way that may depend on which player comes 
first or the number of colliding players. 
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4.2.2. Decentralized Policies under DSEE. In order to minimize the system re- 
gret, it is crucial that each player extracts reliable information for learning the arm 
rank. This requires that each player collects sufficient obsei^vations that are known 
to have been obtained without collisions. As shown in Sec. 3, efficient learning 
can be achieved in DSEE by solely utilizing the observations from the exploration 
sequence. Based on this property, a decentralized arm selection policy can be con- 
structed as follows. Players play all arms in a round-robin fashion with different 
offsets in the exploration sequence, in which collisions can be eliminated and reli- 
able learning achieved. In the exploitation sequence, each player plays the M arms 
with the largest sample means calculated using only local observations from the 
exploration sequence. Specifically, each player distributes the exploitation time to 
the estimated M best arms based on either a prioritized sharing scheme or a fair 
sharing scheme. Note that under a prioritized scheme, each player needs to leai^n 
the specific rank of one or multiple of the M best arms and Theorem 4. 1 can be ap- 
plied. While under a fair sharing scheme, a player only needs to learn the set of the 
M best arms (as addressed in Theorem 4.2) and use the common arm index for fair 
sharing. An example based on a round-robin fair sharing scheme is illustrated in 
Fig. 2. If the arm index is not common to all players, the entire ranks of the M best 
arms need to be learned to achieve fair sharing, e.g., using a round-robin schedule 
with different offsets for playing the arms ordered by their ranks. We point out that 
under a fair sharing scheme, each player achieves the same average reward at the 
same rate. 
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Fig 2. An example of decentralized policies based on DSEE (AI — 2, N = 3, the index of the 
selected arm at each time is given). 
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Theorem 4.3. Under a decentralized policy based on DSEE, Theorem 3.1, 
Theorem 3.2 and Theorem 3.3 hold for 

Proof. It is not difficult to see tliat tlie regret in tiie decentralized policy is 
completely determined by the learning efficiency of the M best arms at each player. 
A detailed proof is thus similar to those of previous theorems. □ 

5. Conclusion. The DSEE approach addresses the fundamental tradeoff be- 
tween exploration and exploitation in MAB by separating, in time, the two often 
conflicting objectives. It has a cleai^ly defined tunable parameter — the cardinality 
of the exploration sequence — which can be adjusted to handle reward distribu- 
tions with heavy tails and the lack of any prior knowledge on the reward models. 
Furthermore, the deterministic separation of exploration from exploitation allows 
easy extensions to variations of MAB, including MAB problems under various 
objectives and with multiple distributed players. 
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